# Load package
library(tidyverse)
# Read in the Dear Abby data
abby <- read_csv("https://mac-stat.github.io/data/dear_abby.csv")Univariate visualization and summaries (Notes)
STAT 155
Notes
Learning goals
By the end of this lesson, you should be able to:
- Describe what a case (or unit of analysis) represents in a dataset.
- Describe what a variable represents in a dataset.
- Identify whether a variable is categorical or quantitative and what summarizations and visualizations are appropriate for that variable
- Write R code to read in data and to summarize and visualize a single variable at a time.
- Interpret key features of barplots, boxplots, histograms, and density plots
- Describe information about the distribution of a quantitative variable using the concepts of shape, center, spread, and outliers
- Relate summary statistics of data to the concepts of shape, center, spread, and outliers
Readings and videos
Choose either the reading or the videos to go through before class.
- Reading: Sections 2.1-2.4, 2.6 in the STAT 155 Notes
- Videos:
File organization: Save this file in the “Activities” subfolder of your “STAT155” folder.
Exercises
Guiding question: What anxieties have been on Americans’ minds over the decades?
Context: Dear Abby is America’s longest running advice column. Started in 1956 by Pauline Phillips under the pseudonym Abigail van Buren, the column continues to this day under the stewardship of her daughter Jeanne. Each column features one or more letters to Abby from anonymous individuals, all signed with a pseudonym. Abby’s response follows each letter.
In 2018, the data journalism site The Pudding published a visual article called 30 Years of American Anxieties in which the authors explored themes in Dear Abby letters from 1985 to 2017. (We only have the questions, not Abby’s responses.) The codebook is available here.
Exercise 1: Get curious
- Hypothesize with each other: what themes do you think might come up often in Dear Abby letters?
- After brainstorming, take a quick glance at the original article from The Pudding to see what themes they explored.
- Go to the very end of the Pudding article to the section titled “Data and Method”. In thinking about the who, what, when, where, why, and how of data context, what concerns/limitations surface with regards to using this data to learn about Americans’ concerns over the decades?
Exercise 2: Importing and getting to know the data
First, in the Console pane of RStudio, run the following command to install some necessary packages (you will need to do this any time you are installing a new package):
install.packages("tidyverse")
Now, in the Quarto pane, run the following code chunk to load the package and load a dataset (you can either click the green arrow in the top right of the code chunk, put your cursor in the code chunk and hit Ctrl+Alt+C [on Windows/Linux] or Command+Option+C [on Mac]).
If it runs successfully, you should see the following output appear in the Console pane:
> # Load package
> library(tidyverse)
>
> # Read in the course evaluation data
> abby <- read_csv("https://mac-stat.github.io/data/dear_abby.csv")
Rows: 20034 Columns: 11
── Column specification ────────────────────────────
Delimiter: ","
chr (4): day, url, title, question_only
dbl (7): year, month, letterId, afinn_overall, a...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Click on the Environment tab (generally in the upper right hand pane in RStudio). Then click the
abbyline. Theabbydata will pop up as a separate pane (like viewing a spreadsheet) – check it out.In this tidy dataset, what is the unit of observation? That is, what is represented in each row of the dataset?
What term do we use for the columns of the dataset?
Try out each function below. Identify what each function tells you about the
abbydata and note this in the???:
# ??? [what do both numbers mean?]
dim(abby)
## [1] 20034 12# ???
nrow(abby)
## [1] 20034# ???
ncol(abby)
## [1] 12# ???
head(abby)
## # A tibble: 6 × 12
## year month day url title letterId question_only question_id afinn_overall
## <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 1985 1 01 proq… WOMA… 1 "i have been… 1 -26
## 2 1985 1 01 proq… WOMA… 1 "this is for… 2 -4
## 3 1985 1 02 proq… LAME… 1 "our 16-year… 1 1
## 4 1985 1 03 proq… 'NOR… 1 "i was a hap… 1 -3
## 5 1985 1 04 proq… IT'S… 1 "you be the … 1 -2
## 6 1985 1 04 proq… IT'S… 1 "a further w… 2 -1
## # ℹ 3 more variables: afinn_pos <dbl>, afinn_neg <dbl>, bing_pos <dbl># ???
names(abby)
## [1] "year" "month" "day" "url"
## [5] "title" "letterId" "question_only" "question_id"
## [9] "afinn_overall" "afinn_pos" "afinn_neg" "bing_pos"- [OPTIONAL] If you’re not sure how exactly to use a function, you can pull up a built-in help page with information about the arguments a function takes (i.e., what goes inside the parentheses), and the output it produces. To do this, click inside the Console pane, and enter
?function_name. For example, to pull up a help page for thedim()function, we can type?dimand hit Enter. Try pulling up the help page for theread_csv()function we used to load the dataset.
Exercise 3: Preparing to summarize and visualize the data
In the next exercises, we will be exploring themes in the Dear Abby questions and the overall “mood” or sentiment of the questions. Before continuing, read the codebook for this dataset for some context about sentiment analysis, which gives us a measure of the mood/sentiment of a text.
What sentiment variables do we have in the dataset? Are they quantitative or categorical?
If we were able to create a
themevariable that took values like “friendship”, “marriage”, and “relationships”, wouldthemebe quantitative or categorical?What visualizations are appropriate for looking at the distribution of a single quantitative variable? What about a single categorical variable?
Exercise 4: Exploring themes in the letters
The dplyr package provides many useful functions for managing data (like creating new variables, summarizing information). The stringr package provides tools for working with strings (text). We’ll use these packages to search for words in the questions in order to (roughly) identify themes/subjects.
The code below searches for words related to mothers, fathers, marriage, and money and combines them into a single theme variable.
- Inside
mutate()the linemoms = ifelse(str_detect(question_only, "mother|mama|mom"), "mom", "no mom")creates a new variable calledmoms. If any of the text “mother”, “mama”, or “mom” (which covers “mommy”) is found, then the variable takes the value “mom”. Otherwise, the variable takes the value “no mom”. - The
dads,marriage, andmoneyvariables are created similarly. - The
themes = str_c(moms, dads, marriage, money, sep = "|")line takes the 4 created variables and combines the text of those variables separated with a |. For example, one value of thethemesvariable is “mom|no_dad|no_marriage|no_money” (which contains words about moms but not dads, marriage, or money).
library(dplyr)
library(stringr)
abby <- abby %>%
mutate(
moms = ifelse(str_detect(question_only, "mother|mama|mom"), "mom", "no mom"),
dads = ifelse(str_detect(question_only, "father|papa|dad"), "dad", "no dad"),
marriage = ifelse(str_detect(question_only, "marriage|marry|married"), "marriage", "no marriage"),
money = ifelse(str_detect(question_only, "money|finance"), "money", "no money"),
themes = str_c(moms, dads, marriage, money, sep = "|")
)- Modify the code above however you wish to replace themes (e.g., replace “moms” with something else) or add new themes to search for. If you want to add a new subject to search for, copy and paste a line for an existing subject above the
themesline, and modify the code like this:- If your subject is captured by multiple words:
YOUR_SUBJECT = ifelse(str_detect(question_only, "WORD1|WORD2|ETC"), "SUBJECT", "NO SUBJECT"), - If your subject is captured by a single word:
YOUR_SUBJECT = ifelse(str_detect(question_only, "WORD"), "SUBJECT", "NO SUBJECT"), - Try to have no more than 6 subjects—otherwise we’ll have too many themes, which will complicate exploration.
- If your subject is captured by multiple words:
- The code below makes a barplot of the
themesvariable using theggplot2visualization package. Before making the plot, make note of what you expect the plot might look like. (This might be hard–just do your best!) Then compare to what you observe when you run the code chunk to make the plot. (Clearly defining your expectations first is good scientific practice to avoid confirmation bias.)
# Load package
library(ggplot2)
# barplot
ggplot(abby, aes(x = themes)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))- We can follow up on the barplot with a simple numerical summary. Whereas the
ggplot2package is great for visualizations,dplyris great for numerical summaries. The code below constructs a table of the number of questions with each theme. Make sure that these numerical summaries match up with what you saw in the barplot.
# Construct a table of counts
abby %>%
count(themes)
## # A tibble: 16 × 2
## themes n
## <chr> <int>
## 1 mom|dad|marriage|money 67
## 2 mom|dad|marriage|no money 567
## 3 mom|dad|no marriage|money 109
## 4 mom|dad|no marriage|no money 906
## 5 mom|no dad|marriage|money 121
## 6 mom|no dad|marriage|no money 839
## 7 mom|no dad|no marriage|money 293
## 8 mom|no dad|no marriage|no money 2462
## 9 no mom|dad|marriage|money 41
## 10 no mom|dad|marriage|no money 350
## 11 no mom|dad|no marriage|money 96
## 12 no mom|dad|no marriage|no money 760
## 13 no mom|no dad|marriage|money 360
## 14 no mom|no dad|marriage|no money 2967
## 15 no mom|no dad|no marriage|money 865
## 16 no mom|no dad|no marriage|no money 9231- Before proceeding, let’s break down the plotting code above. Run each chunk to see how the two lines of code above build up the plot in “layers”. Add comments (on the lines starting with
#) to document what you notice.
# ???
ggplot(abby, aes(x = themes))# ???
ggplot(abby, aes(x = themes)) +
geom_bar()# ???
ggplot(abby, aes(x = themes)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))# ???
ggplot(abby, aes(x = themes)) +
geom_bar() +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))Exercise 5: Exploring sentiment
We’ll look at the distribution of the afinn_overall sentiment variable and associated summary statistics.
- The code below creates a boxplot of this variable. In the comment, make note of how this code is simliar to the code for the barplot above. As in the previous exercise, before running the code chunk to create the plot, make note of what you expect the boxplot to look like.
# ???
ggplot(abby, aes(x = afinn_overall)) +
geom_boxplot()- Challenge: Using the code for the barplot and boxplot as a guide, try to make a histogram and a density plot of the overall average ratings.
- What information is given by the tallest bar of the histogram?
- How would you describe the shape of the distribution?
# Histogram
# Density plot- We can compute summary statistics (numerical summaries) for a quantitative variable using the
summary()function or with thesummarize()function from thedplyrpackage. (1st Qu.and3rd Qu.stand for first and third quartile.) After inspecting these summaries, look back to your boxplot, histogram, and density plot. Which plots show which summaries most clearly?
# Summary statistics
# Using summary() - convenient for computing many summaries in one command
# Does not show the standard deviation
summary(abby$afinn_overall)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -140.000 -5.000 -1.000 -1.358 3.000 77.000 697
# Using summarize() from dplyr
# Note that we use %>% to pipe the data into the summarize() function
# We need to use na.rm = TRUE because there are missing values (NAs)
abby %>%
summarize(mean(afinn_overall, na.rm = TRUE), median(afinn_overall, na.rm = TRUE), sd(afinn_overall, na.rm = TRUE))
## # A tibble: 1 × 3
## mean(afinn_overall, na.rm = TR…¹ median(afinn_overall…² sd(afinn_overall, na…³
## <dbl> <dbl> <dbl>
## 1 -1.36 -1 8.20
## # ℹ abbreviated names: ¹`mean(afinn_overall, na.rm = TRUE)`,
## # ²`median(afinn_overall, na.rm = TRUE)`, ³`sd(afinn_overall, na.rm = TRUE)`- Write a good paragraph describing the information in the histogram (or density plot) by discussing shape, center, spread, and outliers. Incorporate the numerical summaries from part c.
Exercise 6: Box plots vs. histograms vs. density plots
We took 3 different approaches to plotting the quantitative average course variable above. They all have pros and cons.
- What is one pro about the boxplot in comparison to the histogram and density plot?
- What is one con about the boxplot in comparison to the histogram and density plots?
- In this example, which plot do you prefer and why?
Exercise 7: Explore outliers
Given that Dear Abby column is an advice column, it seems natural that the sentiment of the questions would lean more negative. What’s going on with the questions that have particularly positive sentiments?
We can use the filter() function in the dplyr package to look at the . Based on the plots of afinn_overall that you made in Exercise 5, pick a threshold for the afinn_overall variable—we’ll say that questions with an overall sentiment score above this threshold are high outliers. Fill in this number where it says YOUR_THRESHOLD below.
abby %>%
filter(afinn_overall > YOUR_THRESHOLD) %>%
pull(question_only)
## Error in `filter()`:
## ℹ In argument: `afinn_overall > YOUR_THRESHOLD`.
## Caused by error:
## ! object 'YOUR_THRESHOLD' not foundWhat do you notice? Why might these questions have such high sentiment scores?
Exercise 8: Returning to our context, looking ahead
In this activity, we explored data on Dear Abby question, with a focus on exploring a single variable at a time.
- In big picture terms, what have we learned about Dear Abby questions?
- What further curiosities do you have about the data?
Exercise 9: Different ways to think about data visualization
In working with and visualizing data, it’s important to keep in mind what a data point represents. It can reflect the experience of a real person. It might reflect the sentiment in a piece of art. It might reflect history. We’ve taken one very narrow and technical approach to data visualization. Check out the following examples, and write some notes about anything you find interesting.
Exercise 10: Rendering your work
Save this file, and then click the “Render” button in the menu bar for this pane (blue arrow pointing right). This will create an HTML file containing all of the directions, code, and responses from this activity. A preview of the HTML will appear in the browser.
- Scroll through and inspect the document to see how your work was translated into this HTML format. Neat!
- Close the browser tab.
- Go to the “Background Jobs” pane in RStudio and click the Stop button to end the rendering process.
- Navigate to your “Activities” subfolder within your “STAT155” folder and locate the HTML file. You can open it again in your browser to double check.
Reflection
Go to the top of this file and review the learning objectives for this lesson. Which objectives do you have a good handle on, are at least familiar with, or are struggling with? What feels challenging right now? What are some wins from the day?
Response: Put your response here.
Additional Practice
If you have time and want additional practice, try out the following exercises.
Exercise 11: Read in and get to know the weather data
Daily weather data are available for 3 locations in Perth, Australia.
- View the codebook here.
- Complete the code below to read in the data.
# Replace the ??? with your own name for the weather data
# Replace the ___ with the correct function
??? <- ___("https://mac-stat.github.io/data/weather_3_locations.csv")
## Error in parse(text = input): <text>:3:5: unexpected assignment
## 2: # Replace the ___ with the correct function
## 3: ??? <-
## ^Exercise 12: Exploring the data structure
Check out the basic features of the weather data.
# Examine the first six cases
# Find the dimensions of the dataWhat does a case represent in this data?
Exercise 13: Exploring rainfall
The raintoday variable contains information about rainfall.
- Is this variable quantitative or categorical?
- Create an appropriate visualization, and compute appropriate numerical summaries.
- What do you learn about rainfall in Perth?
# Visualization
# Numerical summariesExercise 14: Exploring temperature
The maxtemp variable contains information on the daily high temperature.
- Is this variable quantitative or categorical?
- Create an appropriate visualization, and compute appropriate numerical summaries.
- What do you learn about high temperatures in Perth?
# Visualization
# Numerical summariesExercise 15: Customizing! (CHALLENGE)
Though you will naturally absorb some RStudio code throughout the semester, being an effective statistical thinker and “programmer” does not require that we memorize all code. That would be impossible! In contrast, using the foundation you built today, do some digging online to learn how to customize your visualizations.
- For the histogram below, add a title and more meaningful axis labels. Specifically, title the plot “Distribution of max temperatures in Perth”, change the x-axis label to “Maximum temperature” and y-axis label to “Number of days”. HINT: Do a Google search for something like “add axis labels ggplot”.
# Add a title and axis labels
ggplot(weather, aes(x = maxtemp)) +
geom_histogram()
## Error: object 'weather' not found- Adjust the code below in order to color the bars green. NOTE: Color can be an effective tool, but here it is simply gratuitous.
# Make the bars green
ggplot(weather, aes(x = raintoday)) +
geom_bar()
## Error: object 'weather' not foundCheck out the
ggplot2cheat sheet. Try making some of the other kinds of univariate plots outlined there.What else would you like to change about your plot? Try it!
Done!
- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer – check that your work translated correctly; and (3) Outside RStudio, navigate to your ‘Activities’ subfolder within your ‘STAT155’ folder and locate the HTML file – you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the ‘Stop’ button in the ‘Background Jobs’ pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework!